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The Virtual Memory in the STRETCH Computer 

JOHN COCKE AND HARWOOD C KOT.SKY+ 



EARLY in the planning of the STRETCH com- 
puter it was seen that by using the latest solid 
state components in sophisticated circuits it 
would be possible to increase the speed of floating 
point arithmetic by almost tw^o orders of magnitude 
over that in existing computers. However, there 
seemed to be no possibility of developing on the same 
time-scale economically feasible large memories with 
more than a factor of ten or perhaps twenty increase 
in speed. As a result, the proposed system appeared 
to be in danger of being seriously memory-access 
limited. 

Moreover, as the speed of the floating point opera- 
tions increases, a larger and larger percentage of the 
computer's time is spent on ''parasitic operations", 
i.e., operations whose only function is program con- 
trol and data selection. It was obvious that a radically 
new machine organization was necessary in order to 
capitalize upon the possibilities opened up by the 
high arithmetic speeds in the presence of relatively 
slow memories. 

At this time, a number of persons were considering 
the possibility of a "look-ahead" device in which an 
independent indexing arithmetic unit would prepare 
the effective addresses of instructions and initiate 
memory references to a multiplicity of memory boxes. 
The data thus fetched would be held in high-speed 



This device would serve two desirable purposes: (1) 
some of the parasitic operations would be done in 
parallel and thus not delay the principal calculations, 
and (2) several memory boxes could be running 
simultaneously, giving the effect of higher memory 
speed. 

Since our original work on the virtual memory and 
simulation in 1957-58, a large number of detailed 
changes have been made in the actual hardware 
design of STRETCH. These necessitated several 
modifications in the simulation program to estimate 




Fig. 1 — Schematic of Stretch computer. 

t International Business Machines Corporation, Poughkeepsie, 
New York. 
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this report we are omitting many of these changes for 
expository reasons, since our purpose is to describe 
the virtual memory and timing simulation concepts, 
not to describe the STRETCH hardware exactly. 
The result is that the system described below em- 
bodies a more general system than that found in the 
simulator, which in turn is more general than that 
found in the actual computer. 

General Description of the System 

The major logically-independent blocks of the 
STRETCH computer are shown in Fig. 1. Each of 
the units pictured may be considered as operating 
asynchronously. That is, each does its tasks as fast as 
possible independently of the others. In theory, each 
box could have its own clocking circuits and still 
operate properly. In practice, for economy's sake they 
are all timed by the same master oscillator, but this 
does not destroy their logical independence. 

The bus control unit serves as a routing agent 
between the memories and the various data* proces- 
sing units. If two or more units make a request simul- 
taneously the control unit assigns priorities in the 
following order: (1) High-speed Exchange, (2) Basic 
Exchange, (3) Virtual Memory, and (4) Indexing 
x4.rithmetic Unit. 

The Indexing Arithmetic Unit fetches instructions, 
performs all necessary indexing operations and sends 
the instructions to be executed to the Virtual Memory. 

The Virtual Memory fetches and receives the data 
required by the instruction and holds this data until 
the arithmetic unit is ready for it. The Virtual 
Memory also performs all store operations. It holds 
the data generated by the arithmetic unit or index- 
ing arithmetic unit until the memory to which the 
data must be sent is available. Thus the virtual 
memory acts not only as a "look-ahead" for instruc- 
tions to be fed to the arithmetic unit, but also acts as 
a ''look-behind" storage buffer. 

The actual design of such a "look-ahead" device 
posed a number of logical problems, particularly in 
connection with conditional branches. However, a 
machine organization of this complexity requires a 
detailed timing analysis in order to determine the 
value of addino" hardware in the form of the virtual 
memory. This is especially true since the sole function 
of the virtual memory is to increase machine speed, 
by increasing the efficiency of other devices. It was 
alco felt that the timing analysis could not be made 
on the basis of a few trivial examples (e.g. matrix 
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multiply). Machine performance obtained in this 
fashion can be extremely deceptive. Since a detailed 
timing analysis of a computer of this complexity is 
extremely tedious to carry out by hand, it became 
clear that if the job were to be done, it would be 
necessary to simulate the proposed machine on 
another computer. This prompted us to write the 
simulation program to be described later. 

With the above general organization in mind, let 
us discuss some of the logical problems posed by such 
a system. The first problem is a result of the very 
concept which enables us to obtain such great bene- 
fits from the stored program computer — the ability 
to treat instructions as data. In a system such as we 
have proposed there is a large amount of simultaneous 
operation. For example, the indexing arithmetic unit 
mav be bus"^ i^reTiarino" an instruction before T^revious 
instructions have been completed or even started by 
the arithmetic unit. One of these previous instructions 
may modify the instruction which is presently being 
indexed. The virtual memory must recognize this 
situation and allow the intervening instructions to be 
completed before doing the modified instruction. 

A similar problem exists with respect to ordinary 
data. In order to operate several memories simul- 
taneously, it is necessary to start obtaining data from 
these memories before the preceding operations have 
been completed. Yet, one of these operations may be 
a store into one of the data locations. The virtual 
memory must make provisions to insure that each 
instruction obtains the most up-to-date data as 
implied by the order of the program. 

One of the novel features of the STRETCH com- 
puter is its elaborate interrupt system. Under this 
system, whenever some unexpected occurrence arises, 
the program will be interrupted and control will pass 
to a special routine which is designed to take care 
of the case in question, then return control to the 
original program. In this situation the virtual memory 
must have provisions to retain enough information so 
that when an interrupt occurs we can resume the 
computation exactly where we left off. It must be 
able to recognize which of the changes that Iiave^een 
made in advance are not desired and should be 
obliterated, and which are exact solutions that must 
be restored. 

Another special case arises when a conditional 
branch on arithmetic results occurs. Here we will not 
know which of the two branches we should have taken 
until the preceding instruction is executed. In the 
case where the wrong path has been selected, the 
virtual memory must be prepared to drop the inter- 
mediate results which have been computed and pick 
up the correct branch in a way very similar to that 
of an interrupt. 

Summing up all these logical problems, we may 
state that the fundamental rule for the virtual 
memory is that it must make the asynchronous and 



non-sequential computer give results identical to 
those which would be obtained by performing the 
program one instruction at a time in the order in 
which they are written. 



Definitions 



Operations 



Operations are considered to be of three types : 

(1) Bring or Fetch Type — All instructions re- 
quiring data to be transmitted from external 
memory to the virtual memory. 

(2) Store Type — Instructions requiring the 
transmission of data from the virtual memory 
to external memory or index memory. 
{Note: We consider all indexing instructions 

to be of the store t^^T^e althouo"h the store 
may be to either external memory or index 
memory.) 

(3) Imm^ediate Type — All operations not re- 
quiring data transmission. 

Virtual Memory Quantities 

(1) Virtual Memory — A number of virtual 
memory (or look-ahead) levels (numbered 
to N - 1). 

(2) Level of Virtual Memory — A collection of 
registers and control bits. The contents of the 
jth level are shown in Fig. 2. 

(3) Instruction Address Register (Jy) — Contains 
the address of the instruction currently in the 
jth level. 

(4) Operation Code Register (OPj) — Contains 
the operation to be performed by the arith- 
metic unit. 

(5) Store Bit {Sj) — a one-bit trigger which 
indicates the level contains a store t^T^e 
instruction. 

(6) Bring Bit (B,) — A one-bit trigger which 
indicates the level, contains a fetch type 
instruction, ior which the data. jaccesB has not 
been started. 

(7) Forwarding Bit (Fj) — A one-bit trigger 
which indicates that the jth level must 
transmit data to another level. 

(8) Forwarding Address {FA y) — A register which 
contains the number of the level to which the 
data must be sent if F, is set. 
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Fig. 2 — Virtual memory — contents of one level. 
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(3) Interlock three (/g) : Ci = C2 Similar to I2, 
prevents a fetch from being initiated. 

(4) Interlock four (1^): d = C4 Prevents the 

tion. 

(5) Interlock five (h) :Ci = C4 + N Prevents the 
lAU from placing the next instruction into the 
level indicated by Ci because the instruction 
there has not been executed yet. 

Logic of the Virtual Memory 

There are two basic precepts which must be kept 
in mind to understand the operation of the virtual 
memory : 

(1) The OK bit (Oj) being -set in the jth level indi- 
cates that the contents of Dj is the correct 
data called for by DAj. All operations will be 
performed only under this, condition, and 
logical decisions will be made in such a manner 
as to make sure this is the case. 

(2) Addresses can be compared by the lAU with 
every DAj address simultaneously. DAj is not 
used for any level which does not have its C, 
bit set. If a comparison exists between a new 
DAj being placed in the virtual memory and 
an old DAk, the compare bit Ca is turned off 
and the address of level j is placed in FA^. 
This insures a unique meaning for the com- 
parison. If this were not done, another instruc- 
tion address DAe might compare against two 
levels and thus cause an ambiguity. 

Instruction Fetch Loo'ic 

Fig. 4 is a flow diagram of the lAU Instruction 
Fetch Procedure. The logic is as follows: If the lAU 
is ready to fetch another instruction, it compares the 
instruction address with all the DAj's of virtual 
memory. If there is no comparison, the instruction 
fetch is initiated. If there is a comparison, the lAU 



INTERLOCKS I4 AND Ij ARE AS SHOWN, THE OTHER INTERLOCKS 
ARE DONE IN A SIMILAR MANNER. 

Fig. 3 — Virtual memory interlocks. 

(9) O. K. Bit (OKj) — A trigger which when set 
indicates that the correct data for the instruc- 
tion to be executed is present in the jth data 
field. 

(10) Data Field (Dj) — A register which contains 
the operand data for the instruction. 

(11) Data Address (DAj) — The operand data 
address (already indexed by the lAU) for Dj. 

(12) Compare Bit (Cy) — A trigger which if not set 
indicates the address in DAj should not be 
included in any address comparisons being 
made. 

Counters 

The virtual memory is controlled by a set of 
counters which count mod(A), where N is the number 
of virtual memory levels. 

(1) Counter one (Ci) — Indicates the level into 
which the next instruction may be placed. 

(2) Counter two (C2 — Indicates the level from 
which the next bring type instruction may be 
initiated. 

(3) Counter three (C3) — Indicates the level from 
which the next store type instruction may be 
initiated. 

(4) Counter four (C4) — Indicates the level from 
which the arithmetic unit will get its next 
operation and data. 

Interlocks 

The above counters must be interlocked in the 
following manner to assure proper sequential opera- 
tion of the computer (see Fig. 3:) 

(1) Interlock one (7i): Ci = C3 -{- N' Prevents 
the lAU from placing the next operation into 
the level indicated by Ci because an unexe- 
cuted store is still in the level. 

(2) Interlock two (72): Ci = C3 Prevents a store 
from being initiated from the level indicated 
by Cs because the store has already been done. 
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Fig. 4 — Instruction fetch procedure. 
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must take its instruction from the virtual memory 
provided the OK bit is set; otherwise, it must wait 
until the OK bit is set. 

Note: This procedure prevents the logical difficulty 
mentioned earlier whicn would occur if the virtual 
memory contained a store order into the instruction 
presently being fetched. 

For Example : a STORE Address a -\- 2 * 

a + 1 LOAD M, i 

a -^ 2 ADD N, i 

a + 3 

The store to a + ^ must be done in sequence or the 
old value A" would be used for the address instead of 
the quantity being set by a. 

Indexing Logic 

Fig. 5 shows the flow for instruction indexing. After 
determining that an instruction is ready to be in- 
dexed, the L4U tests whether or not the index value 
is available. If it is, the indexing operation is started; 
if not, the memory reference is started and the lAU 
waits until the data returns before proceeding. If the 
index-fetch has not been started, the lAU compares 
the index address against all the data addresses in 
virtual memory. If none compare, the index value is 
fetched normally. If one does compare, the index 
fetch is held up until the OK bit is set for the data. 
This value from the virtual memory is then used for 
indexing the instruction. 
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YES 
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—I 
WAIT 



i I 
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Fig. 5 — Indexing procedure. 

Logic of Putting Instructions in the Virtual Memory 

(1) Figs. 6, 6A, 6B, 6C represent the logical flow 
for putting instructions into the virtual 
memory. If the indexing arithmetic unit has 
an instruction prepared for the virtual mem- 
ory, it may transmit the instruction into the 



virtual memory if interlocks one and five do 
not forbid it. These interlocks prohibit a new 
instruction from destroying an old one which 
has not been executed as yet, whether an 
arithmetic operation (is) or an unexecuted 
store (7i). The handling of the instructions 
varies depending on whether they are of the 
bring type, store type, or immediate type. 

(2) The bring type, as described in Fig. 6 A, pro- 
ceeds as follows: If the effective data address 
of the instruction compares with the DA 
address in some level, the instruction, its op 
code, and effective data address are loaded 
into the level marked by Ci. The compare bit 
for level Ci is set to one while the compare bit 
for the compared-with level is set to zero. If 
the OK bit in this compared-with level is set, 
meaning that the data located there is correct, 
the data is transmitted directly to the Ci level 
and its OK bit is also set. If the OK bit is not 
set, we must tag the compared-with level by 
setting its forwarding bit and by putting the 
value of Ci into its forwarding address; the 
bring bit for level Ci is also set to zero since no 
further data fetch is required. 

If the effective data address does not compare 
with any Virtual Memory level, the instruc- 
tion is put directly into level Ci, its OK bit is 
set to zero, and its bring bit is set to one, indi- 
cating that a fetch must be started. 

(3) Fig. 6B shows the store type procedure. If the 
effective address of the instruction 'does not 
compare with the DA address in some level, 
the instruction is placed into the level marked 
by Ci. The store bit is set to one indicating 
that a store will be required. The level's bring 
bit and forwarding bit are set to zero; its 
compare bit is set to one. If on the other hand 
the addresses do compare, the same procedure 
is followed; but in addition, the compare bit 
in the level compared-with is set to zero so 
that future c'^oliTpansohsmir^^ 

The OK bit has not yet been set. It is set to 
one if the operation is an index store and set 
to zero if it is an ordinary store. For the ordi- 
nary store it is clear that the OK bit should be 
zero since the data must come from the arith- 
metic unit after the preceding instruction is 
executed. 

As was mentioned in the definition previously 
we treat all indexing instructions as store 
type and place the new value of the indexed 
quantity into the virtual memory. This is 
done because the indexing arithmetic unit is 
going ahead of the normal order of instruction 
execution and an interruption may occur 
before this indexing instruction should have 
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Fig. 6 — Procedure for placing instructions 
into the virtual memory. 
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IN THE Ci LEVEL PUT THE INSTRUC- 
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FORWARDING BIT, THE COMPARE BIT, 
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Fig. 6(a) — Logical conditions for bring type operations. 

been done. In this case, the old value of the 
index is still in the index register. On the other 
hand the indexing arithmetic unit compares 
with the virtual memory and extracts the 
most recent value of the index for indexing 
succeeding instructions. The OK bit is set to 
one since the appropriate data is in the above 
level. Both the new and old index values must 
be carried along to give logically correct con- 
ditions in the case of an interrupt. A situation 
very similar to interrupt occurs in branches on 
arithmetic results where the indexing arith- 
metic unit ^'guesses" w^hich branch will be 
taken and proceeds with fetching and process- 
ing the instructions on this branch, subject to 
being wiped out if the guess proves to be 
wrong. (See the discussion on ''Wrong way 
Branches" below.) 
(4) Immediate type instructions are the simplest 
type because they essentially carry their data 
with them. Fig. 6C shows the logic in this case. 
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Fig. 6(b) — Logical conditions for store type operations. 



FROM FIGURE 6 



IN THE C, LEVEL: 

PUT THE INSTRUCTION ADDRESS IN lA, PUT 

THE OP CODE IN OP PUT THE DATA ADDRESS 

INTO D (NOTE THIS) SETttK. BIT TO ONE. 

SET FORWARDING BIT. THE BRING BIT 

AND STORE BIT TO ZERO. SET THE COMB>RE 

BIT TO ZERO (NOTE). 



RETURN TO TOP OF FIGURE 6 

Fig. 6(c) — Logical conditions for immediate type operations. 

The instruction is placed in the virtual 
memory level marked by Ci. The address field 
of the instruction is placed in the data field of 
Ci. The OK bit is set to one indicating the 
data is present. The bring and store bits are 
both set to zero. The compare bit is set to 
zero since the DA address field has no mean- 
ing for immediate type ops. (The data address 
of the last instruction which occupied this 
level still remains in DA, so it has no relation 
to the present D field. ) 



DOES I J PREVENT 
FETCH 



h 



IS THE BRING 
BIT SET FOR 
LEVEL Cz 



IS THE BUS FREE 



^ 



IS MEMORY FREE 

n r 



^ 



NO WAIT 



START DATA FETCH SET 
RETURN ADDRESS TO LEVEL 
Cj. SET BRING BIT FOR 
Cj TO ZERO 



ADVANCE FETCH 
COUNTER (Cj) 
—f 



Fig. 7 — Data fetch procedure. 
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Logic of Data Fetching 

See Fig. 7 : When an instruction of the bring type 
has been placed in the virtual memory, the data re- 
quired by the instruction in general will not be present 
(unless a comparison exists as was described above) 
and thus the data must be obtained from core stor- 
age. The fetch cannot be started if interlock h holds, 
which means all the fetches corresponding to the 
instructions presently in the virtual memory have 
been started. If a fetch is possible, the bring bit at 
level C2 indicates whether or not a fetch is necessary. 
If necessary the fetch may be started if the memory 
bus and memory unit corresponding to the data 
address are not already being used. When the fetch is 
started, the bring bit for level C2 is set to zero. The 
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Logic of Data Storing 

Fig. 8 shows the Data Store Logic, which is very 
similar to that for data fetching just described. The 
only significant difference is that the OK bit must be 
set before the operation can be started. 
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Fig. 8 — Data store procedure. 

Logic for Placing Data into the Virtual Memory 

In Fig. 9, we see the logical conditions which must 
be satisfied by the data returning from memory 
addressed to the virtual memory. The return address 
which was supplied when the fetch was started selects 
the level into which the data will be placed. The OK 
bit is then set to one, indicating that the proper data is 
in the level. The operation is complete at this point 
unless the forwarding bit is set. In this case, the data 
must be forwarded to the level designated by the 
forwarding address. This procedure continues from 
level to level as long as the data continues to arrive 
into a level whose forwarding bit is set. This procedure 
automatically suppUes all operands present having 
identical data addresses with the proper data, without 
additional memory references. 
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Fig. 9 — Procedure for placing data into virtual memory. 

Logic of Removing Instructions from the Virtual 
Memory 

In Fig. 10, we notice that as the arithmetic unit 
completes an instruction it checks to see if the next 
instruction in the virtual memory is ready to be ex- 
ecuted (indicated by interlock 74). Note that the 
operation may be an unconditional branch, a condi- 
tional branch, or an index type store, as well as a 
normal bring or store type instruction involving the 
accumulator. Fig. 10 shows only the cases which in- 
volve the universal accumulator. Instructions such as 
the unconditional branches are merely ignored at this 
point. They are carried along only to provide the data 
for recovery in the event an interrupt occurs. The 
execution of the conditional branches on arithmetic 
results are described in the next section. 

If the next instruction marked by counter Ca is 
ready, it is fed into the arithmetic unit. If it is a store 
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Fig. 10— Procedure for removing instructions 
from virtual memory. 
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type, the data is gated from the accumulator into the 
data field of level d, and the OK bit is set to one. If 
the forwarding bit of the level is set, a forwarding 

■nrnppHiirp in t,V»is PQap is psspniinl fr^r fVio -nrrinfir 

logical operation of the computer, whereas in the 
bring case it is a time-saver only. 

If the instruction is not a store tjrpe, the arithmetic 
unit must hold up until the OK bit for the level is set. 
When the OK bit is set, the instruction is gated into 
the arithmetic unit and executed. 

Logic of Interrupt Procedure 

If for any cause an interrupt (or trap) from a spe- 
cial condition occurs, the instruction which is being 
executed in the arithmetic unit is completed. How- 
ever, the next instruction is not executed in spite of 
the fact all the data preparation for it may have been 
completed. The address in the I A (instruction ad- 
dress) field will serve as the value to reset the instruc- 
tion counter if it is desired. 

The Virtual Memory is initialized, i.e., set to the 
starting conditions of an interrupt, with the excep- 
tion that all store orders which have already received 
data from the accumulators must be executed first. 
If the interrupt is of such a nature that the normal 
flow of instructions is not resumed, the procedure of 
storing the modified values of the index registers in 
the Virtual Memory gives logically correct results, 
i.e., the same as if the interrupt had occurred before 
the indexing took place. 

Description of Timing Simulation Program 

During the logical design of STRETCH it was 
Tior»ckooQT'-ir +r> t^ttjyo the valuc of the virtual m.emor^'^ 
concept and to assist in the selection of optimum 
values of various system design parameters. Ex- 
amples of such parameters are: The number of 
memory boxes, interlace and allocation of memory 
addresses, and numbers of virtual memory levels. 
Also of interest were trade-ofT factors for speeds of 
indexing arithmetic unit, memories, etc. 

In November 1957 the Timing Simulator (SIM-2) 
described here was written for the IBM 704. This 
program attempted to answer such questions quan- 
titatively by simulating the time-wise operation of 
STRETCH on typical test programs coded in 
STRETCH language. 

The basic logic of the 704 program follows the 
principles just described in the preceding section for 



ponents. This means that there are a large number of 
logical steps being executed at any one time in the 
computer, each of them proceeding at its ov/n rate. 



T, 
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operations, we have broken the continuous time 
variable into finite time steps. The basic time step is 
taken as 0.1 microsecond in the simulator. 

By taking 0.1 microsecond as our quantum of time, 
we are automatically setting the scale of the smallest 
circuit entities which we will consider as being those 
which accomplish complete functions in 0.1 micro- 
second or few multiples thereof. Thus, by using this 
pholosophy, and considering many of the components 
of the computer as "black boxes", we greatly simplify 
the details which must be considered without intro- 
ducing serious timing inaccuracies. 

Our experience has indicated that more informa- 
tion was gained by making a large number of fast 
parameter studies using different configurations and 
programs than could have been obtained by a very 
slow, detailed simulation of a few runs with more 
precision per run. Even so, our time scale is too fine 
to make serious input-output application studies. 
These would require a simpler simulator having at 
least a factor of 10 coarser basic time interval. 



Logic of the Simulator 

In the asynchronous organization of STRETCH 
there can be many major components operating at 
any one time. To achieve this parallel effect in the 
simulator we essentially "hold time still" and scan 
the entire machine representation at each time step. 

J. Jk.X CJ.J.Vd.^XJ. \_/ V Vyi V XXXCAilXJJL KJX\J\^X\. XJX KJXXK^ ^XKJ^XtAIXXX XKJ 

traversed at each time step, if there is no activity 
required in a given block, only a few tests need be 
made by the code. 

If in this process it is determined that a given 
logical unit should do an operation, the time interval 
required for the operation is obtained from a table of 
constants. The speed of the various logical units can 
thus be changed parametrically by changing the 
values in the tables. A constant obtained from the 
tables is inserted into a memory location called the 
time counter for that unit. At each time step the 
program reduces this counter by one until it reaches 
zero. Thus, the fact that the counter is non-zero can 
be used to indicate that the particular logical unit is 
busy and not available to service other requests. 
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simulator is a timing simulator and does not execute 
the instructions in an arithmetic sense. It traces the 
time-wise progress of the instructions through the 
components of the computer, observing all the inter- 
locks and time delays necessary for correct representa- 
tion of the behavior of the machine. 

One of the fundamental concepts in the STRETCH 
design is that of asynchronous operation of the com^- 



input. 

In addition to the time counters many of the 
logical blocks contain other conditions or interlocks 
which affect the operation of the block. These condi- 
tions are stored in the program and tested before 
action is undertaken. 

It is interesting to note that since the simulator 
simulates timing only, the sequence of instructions 
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to be executed must be furnished as a "string" with 
all loops unwound. However, to make the computer 
behave as it actually would, the loops must be fur- 
nished with "wrong way" paths given for the cases 
where the computer would take such paths. Also one 
must furniish more than enough information along 
such paths since it is difficult to predict in advance 
how far the computer will get down the wrong path 
before it it called back. 

Parameters are changed from one run to another 
by use of control cards. The control cards are set up 
in such a way that any number of parameters may be 
changed between runs. Results are given either as 
detailed timing charts or as summary listings for each 
problem. The usual procedure has been to print only 
summary results while making a series of parameter 
studies. The detailed timing charts as printed on the 
704 for most problems would be about 50 feet long 
for each run. Since over 1000 cases have been run, it 
is clear that only a few cases could be printed in full 
detail. These are particularly useful in seeking the 
causes of conflicts which slow the computer. 

Results of Parameter Studies 

When the simulator program was completed, we 
undertook a series of studies in which the main 
parameters describing the STRETCH system were 
varied one or two at a time in order to get a measure 
for the importance of different effects. After this we 
began to specialize the studies towards answering 
specific questions in the STRETCH design. 

1 INITIALIZATION 

2 ARITHMETIC UNIT 

3 DECODE OPERATIONS 

4 VIRTUAL MEMORY 

5 INDEXING ARITHMETIC UNIT 

6 BUS FROM MEMORY 

7 BUS TO MEMORY 

8 1/0 REFERENCES TO MEMORY 

9 VM STORE REFERENCES TO MEMORY 
V M FETCH REFERENCES TO MEMORY 

1 I.AU REFERENCES TO MEMORY 

2 INSTRUCTION FETCH REFERENCES TO MEMORY 

3 COUNT- DOWN TIME 
PRINT DETAILED LISTING 
SUMMARIZE AND PRINT 



Fig. 11 — SIM — 2 simplified flow diagram. 

The simplified flow diagram in Fig. 11, indicates 
the order in which the subroutines for the various 
logTcar units are executed at each time step . Using 
the types of techniques just described above, the 
logical subroutines simulate the action of the com- 
ponents of the computer such as the virtual memory, 
arithmetic unit, etc. 

Some Results of the Simulation Studies 

Fig. 12 shows examples of the type of output list- 
ings given by the simulator. Fig. 12 is a piece of a long 
timing chart with each line of printing representing 
0.1 microsecond of time. The columns represent the 
various components of the computer. On the left and 
right are timing counts subdividing each micro- 
second. On the far right are conflict indicators (C on 
the charts) and waiting indicators, W, which indicate 
when interlocks prevent operations from proceeding. 



The 2nd column, //, gives the number of the 
instruction being indexed. The 4th column, AU, 
gives the number of the instruction using the arith- 
metic unit. The next four columns represent the 
instructions using the memory buses. The columns 
labeled X- F-, and M- represent the index, fast, and 
main memories. A string of X's in the columns repre- 
sents the cycle time of the memory. The number 
indicates the instruction using the memory and the 
number of times which it is repeated gives the read- 
out time of the memory. The columns L- indicate 
which instruction is located in the virtual memory 
levels. The other columns are for details in analysis 
and need not be considered here. 

Five of the test problems used most frequently are 
described below. Other test problems were used for 
specific studies, but since the results were similar for 
all problems of a given type, we gradually discon- 
tinued using them. The following were originally 
selected as being typical of different classes of 
problems. 

(1) Mesh Problem — Part of an hydrodynamics 
problem from Los Alamos. It contains a more 
or less "average" mixture of instructions for 
scientific problems : 85% floating point instruc- 
tions, 14% index modification instructions, and 
1% VFL. It is usually arithmetic unit limited. 

(2) Monte Carlo Branching Problem — Part of an 
actual Monte Carlo neutron diffusion code. It 
represents a chain of logical decisions with very 
little arithmetic in between. It contains 47% 
floating point, 15% index modification instruc- 
tions, and 36% branches of the indicator and 
unconditional types. It is largely instruction- 
access limited. 

(3) Reactor Problem — The inner loop of a neutron 
diffusion problem. It consists of 90% floating 
point arithmetic (39% of which are multiplys) 
and 10% index modification instructions. It is 
almost entirely arithmetic unit limited. 

(4) Computer Test Problem — The evaluation of a 
polynominal using computed indices. It has 
71% floating point, 10% index modification, 
6% VFL and 13% indicator branches. It is 
usually arithmetic unit limited, but not for all 
configurations. 

(5) Simultaneous Equations — The inner loop of a 
matrix inversion routine 67% floating point 
and 33% index modification. Arithmetic and 
logic are about equally important. It is limited 
both by arithmetic and instruction-access 
speeds. 

Speed vs. Number of Levels of Virtual Memory 

Fig. 13 shows the effect on computer performance 
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Fig. 12 — Listing of simulator print-out. 
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of varying the number of levels of virtual memory. 
Curves for the Monte Carlo and Mesh Calculations 
with two sets of arithmetic and indexing arithmetic 
speeds are shown. The A U times given are averages 
for all operations. 



MESH CALC. WITH 
AU TIME 0.64«i« 
lAU TIME 0£^» 



MESH CALC WITH 
AU TIME l.28>ts 
lAU TIME l.4;.s 



MONTE CARLO CALC. 
AU TIME 0.64/.S- 
lAU TIME 0.6y>s 

MONTE CARLO CALC 
AU TIME I28mS 
lAU TIME l.4;s8 




12 3 4 5 6 7 8 
NO. LEVELS OF LOOK-AHEAD 



Fig. 13 — Computer speed vs. no. of levels of look-ahead registers; 
4 main mems. 2.0 jus; 2 fast mems. 0.6 fjs for two sets of arith. 
speeds. 



A number of interesting results are apparent from 
these curves: 



TD 



Speed vs. Number of Main Memory Units 

Fig. 14 shows how internal computer performance 
varies with the total number of memory units for a 
particular problem. The entire calculation is assumed 
to be contained in memory for all cases. The speed 
gain from overlapping memories is quite apparent 
from the graphs. 




MESH CALC WITH REGULAR 
SEPARATE 0.6 ^s FAST MEM. 



MESH CALC WITH DATA 
AND INSTR SHARING SAME 
20^s MAIN MEM BOXES 

MONTE CARLO WITH REGULAR 
SEPARATE 0.6 ,.s FAST MEM. 



_MONTE CARLO SEPARATE 
y20,^s INSTR. MEM 



MONTE CARLO WITH DATA 
AND INSTR SHARING SAME 
2.0^s MAIN MEM, BOXES 



There is a tremendous gain to be had In goTng 
to the virtual memory organization. The point 
for "0 levels" means that the arithmetic unit 
is tied directly to the instruction preparation 
unit, although simple Indexing-Execution over- 
lap is still possible. 

^9^ T'Via oTQin in -norf nrm q n pp crne'fi nn -\rar\r rnnirlKr 

y^y ^^^^ &""^" -'" ^^^^^^...^^^^^^^ &^^- ^f '^^J — r" — J 

for the first two levels, then rises more slowly 
for the rest of the range. 

(3) A large number of levels does the Monte Carlo 
problem less good than thte Mesh problem 
because constant branching in the former 
spoils the flow of instructions. Notice that the 
curve for the Monte Carlo problem actually 
decreases slightly beyond six levels. This phe- 
nomenon is a result of memory conflicts caused 
by extraneous memory references started by 
the computer running ahead on the wrong-way 
paths of branches. 

(4) The computer performance on a given problem 
is clearly less for slower arithmetic speeds. 
However, it is important to note that the 
sensitivity of the performance is also less for 
slower arithmetic speeds. The virtual memory 
improves the performance in either case, but 
it is not a substitute for a fast arithmetic unit. 



NO. MAIN MEMORY BOXES 

Fig. 14 — Computer speed vs. number of main memory boxes: 
4 level LA; 0.6 ms I AU time; 0.64 /xs AU time. 

The speed differential between having and not 
having instructions separated from data arises from 
delays in instruction fetches caused by the memory 
— linits-being busy with data. The size of this effect 
varies from problem to problem, being less pro- 
nounced for problems which are arithmetic limited 
and more for logical problems. 

The X's on the graph show the effect of replacing 
the 0.6 Msec instruction memories by a pair of 2.0 
^sec memories. The resulting performance change is 
small for th'^ IM^esh '•^roblem which is arithmetic 
limited, but large for the instruction-fetch limited 
Monte Carlo problem. 

Speed vs. Arithmetic Unit and Indexing Arithmetic 
Unit^imes 

Although everyone realizes the importance of 
arithmetic speed on overall computer performance, 
it was not until the simulator results became available 
that the true importance of the indexing arithmetic 
speeds was recognized. Figs. 15 and 16 show a two- 
parameter family of curves giving the computer 
speed as a function of the A U and I A U times. 

Fig. 16, in which the arithmetic time is the abscissa, 
shows an interesting "saturation" effect where the 
computer performance is independent of AU speed 
below some critical value. Thus it makes no sense to 
strain AU speeds if the lAU is not improved to 
match. The curves in Fig. 15 show the same effect, 
i.e., the lAU speed serves as a "ceiling" on per- 
formance beyond which the AU speed cannot pass. 
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speed as long as its percent efficiency is reasonable 
for a variety of problems. One will stop this process 
when the overall performance gain no longer matches 
the increase in hardware and comi^lexit'*'^. Thus the 
arithmetic unit efficiency is a by-product of this 
design process, not the prime variable. 



upeGu, vs. KyOncuTvent input-Out put Activity 

Because of the relative time scales of I/O activity 
and the CPU processing speeds, the simulator can- 
not take account the availability or non-availability 
of data from I/O on the program being run. How- 
ever, we can observe the effect on the computation 
of the I/O devices operating at different rates 
simultaneously with computing. 

Using the STRETCH control word philosophy, it 
is possible to have a number of input-output units 
operating at the same time the Central Processing 
Unit is running. The Basic Exchange can reach a 
peak rate of 1 word every 10 microseconds. The high 
speed disk normally operates at 1 word every 4 
microseconds. Since the mechanical devices take 
priority over the CPU in addressing memory, the 
computation slows down because of memory-busy 
confficts. 

Fig. 17 shows an example of how internal comput- 
ing speed is slowed as the I/O word rates are varied 
continuously. At the theoretical "choke off>" the I/O 
devices take all the memory cycles available and stop 
the calculation. Notice that this condition can never 
arise for any I/O rates presently attainable. 



BASE: DATA a INSTRS MIXED IN 4 MEMS 




} MESH CALC 



} MONTE CARLO CALC 



INDEXING ARITHMETIC TIME (;»sec) 

(AVERAGE TIME TO INDEX ONE INSTRUCTION INCL DECODE 

AND STORING MODIFIED ADDR) 

Fig. 15 — Computer speed vs. indexing arith. times for various arith- 
metic unit times: 4 main mems. 2.0 ys; 2 fast mems. 0.6 tm; 4 
levels of look-ahead. 
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FOR MESH CALC 



— FOR MONTE CARLO CALC. 



AVERAGE ARITHMETIC TIME (^sec) 
(EXECUTION TIME FOR "AVERAGE" OPERATION) 

Fig. 16— Computer speed vs. arithmetic times for various indexing 
arithmetic unit times: 4 main mems. 2.0 yus; 2 fast mems. 0.6 /<s; 
4 levels of look-ahead. 

Arithmetic Unit Efficiency 

One fallacy which is frequently quoted is that the 
goal of improved computer organization is to increase 
the arithmetic unit efficiency. Actually there are two 
reasons why this is not the goal in itself. The first is 
that arithmetic efficiency depends strongly on the 
mixture of arithmetic and logic in a given problem 
so that a general purpose computer cannot hope to 
give equally high percentage utility to all. The second 
reason is that the simDlest wa^*^ to increas*^ th*^ "r^^h- 
metic unit efficiency in any asynchronous case is to 
slow down the arithmetic unit. 

The real goal in improved organization is maxi- 
mum overall computer performance for minimum 
cost. One will tend to increase the arithmetic unit 



Lj-30%-' 



■,-40%-- 



"-50% 



-70% 



-90%- 




5 10 15 20 

WCrS RATE-miCnOSECCNDS SETWcEri CCNSECuTlvE WORDS 

Fig. 17 — Internal computing speed. Percentage reduction in speed 
caused by input-output devices referencing memory at different 
rates while the calculation is proceeding. 

A STRETCH system with only 1 or 2 memory 
units has less performance than a larger one for three 
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reasons: (1) The top speed of the system is reduced 
by the loss of memory overlap, (2) it has a larger 
I/O penalty when I/O is run concurrently with the 
computation, and (3) the smaller amount of data 
which can be held in the memory at one time increases 
the amount of I/O activity needed to do the job. 
Note, however, that increasing the memory size on a 
computer of conventional organization only improves 
the third area. 

A Study of Branching on Arithmetic Results in Stretch 

One penalty of the non-sequential preparation and 
execution of instructions used in STRETCH is that 
if there is a branch in the problem code it spoils the 
smooth flow of instructions to the indexing arith- 
metic unit. Any branch in a program will cause some 
delay, but the most serious ones are the branches on 
arithmetic results which cannot be detected by the 
indexing arithmetic unit in advance. 

There are two fundamental ways in which branches 
on arithmetic unit results can be handled by the 
computer. 

(1) The computer can stop the flow of instructions 
until the arithmetic unit has completed the 
preceding operation so that the result is 
known, then fetch the next correct instruction. 
This places a delay on every A U result branch 
whether taken or not. 

(2) The computer can ''guess" which way the 
branch is going to go before it is taken and 
proceed with fetching and preparing the in- 
structions along one path with the under- 
standing that if the guess was wrong, these 
instructions must be discarded and the correct 
path taken instead. 

A detailed series of simulator runs were made 
to study this situation and to decide which way 
STRETCH should be designed. Some of the general 
observations were: 

(1) The performance variation in a problem with 
considerable arithmetic data branching can 
vary by approximately ± 15% depending on 
the way in which the branches are handled. 

(2) Holding-up on every branch seems to be less 
desirable than any of the guessing procedures. 
Some time is lost whenever a branch is exe- 
cuted rather than proceeding to the next 
instruction. Unless there is an unusual situation 
which there is a very large probability that the 
branch will always be taken, the least time will 
be lost if one assumes that the branch is not 
taken. 

(3) The theoretically highest performance would 
be obtained if each branch had an extra "guess 



bit" which would permit the programmer to 
specify which way he estimates each branch 
will most hkely go. However this would place 
a considerable extra burden on the programmer 
for the gains promised. (It also uses up many 
valuable OP codes.) 

(4) It is reahzed that there is a "feedback" in such 
decisions because the way in which the machine 
guesses the branches will influence future pro- 
grammers to write their codes to take advan- 
tage of the speed gain. The result is that the 
statistics of the future will be biased in favor 
of the system chosen for the machine, and thus 
"prove" that it was the right decision. 
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Discussion 

M. RuHnoff: What happens to the look-ahead process if a sequence 
of branch instructions is programmed, such as in the binary selection 
of one of many subroutines? An example is the selection of the desired 
piece of a piece-wise function approximation. 

Dr. Kolsky: If it is an unconditional branch then it takes a correct 
path. 

Mr. Rubinoff: These are conditional? 

Dr. Kolsky: The machine makes the assumption that the branch is 
not taken. If the path is not taken then the branch time is covered up. 

M. S. Maxwell (US Naval Weapons Lab.): Discuss maintenance on 
diagnostic programs to insure proper operation of virtual memory. 

Dr. Kolsky: The STRETCH machine has as one of its unusual 
features a part of the interrupt system capable of recording the status 
of the machine at the instant the interrupt occurs, so that one gets a 
"snapshot"' of the macliine as of that moment. This occurs so you do 
not Have to go bac^ an^ duplicate the errorT)^' running tlie program 
over and over again. I think you can see by the way the virtual 
memory operates that it would be very difficult to duplicate the error 
again. This feature, whereby a snapshot is made at the time of the 
error occurs, enables the engineers to go over the records and deter- 
mine exactly what it was that caused the failure. Of course, the 
machine has a very elaborate checking mechanism as was described 
by Erich Bloch in his paper yesterday. 

J. Anderson (Burroughs): Is the addressing of STRETCH'S main 
memory sequential within a memory unit or sequential across several 
memory units? 

Dr. Kolsky: The Los Alamos machine has six memories. Two are 
alternating and the other four are sequential across all four. 

R. Maclrdyre (Bausch <& Lomb): Is the virtual memory addressable 
in case of a branch? 

Dr. Kolsky: No, it is completely unavailable to the programmer. You 
can see that one would get into some rather tricky logical problems 
if it could be addressed. We discussed this at length and one gets 
into a terrible spider web of logical complications when one does that. 



